perf(l1): reduce BAL parallel-path overhead#6639
Conversation
|
Lines of code reportTotal lines added: Detailed view |
Benchmark Results ComparisonNo significant difference was registered for any benchmark run. Detailed ResultsBenchmark Results: BubbleSort
Benchmark Results: ERC20Approval
Benchmark Results: ERC20Mint
Benchmark Results: ERC20Transfer
Benchmark Results: Factorial
Benchmark Results: FactorialRecursive
Benchmark Results: Fibonacci
Benchmark Results: FibonacciRecursive
Benchmark Results: ManyHashes
Benchmark Results: MstoreBench
Benchmark Results: Push
Benchmark Results: SstoreBench_no_opt
|
75c47b2 to
e792377
Compare
Benchmark Block Execution Results Comparison Against Main
|
e792377 to
161f583
Compare
161f583 to
7e081a6
Compare
🤖 Claude Code ReviewNow I have enough context for a thorough review. PR Review:
|
🤖 Codex Code ReviewNo blocking correctness or security findings from static review. The deferred BAL-validation ordering and the trie-worker idle handshake both look coherent. Non-blocking perf note:
Testing gap:
Automated review by OpenAI Codex · gpt-5.4 · custom prompt |
Greptile SummaryThis PR is a performance bundle for the BAL parallel-execution path rebased onto
Note: change B in the PR description ( Confidence Score: 4/5The parallel execution path changes are logically correct and well-tested; the two minor concerns do not affect production block processing. Core logic is sound. Main risk is the implicit one-batch invariant in handle_merkleization_bal — a future modification sending a second batch would silently produce a wrong state root. A debug_assert after the Ok arm would fully mitigate this. crates/blockchain/blockchain.rs (single-recv invariant and error-path Stage C) and crates/vm/backends/levm/mod.rs (deferred BAL validation fast paths)
|
| Filename | Overview |
|---|---|
| crates/blockchain/blockchain.rs | Stage A changed from channel drain to single rx.recv(); allows Stage B to overlap with parallel exec. Introduces a fragile one-batch invariant and triggers Stage C (16 trie opens) on the error path. |
| crates/vm/backends/levm/mod.rs | BAL validation moved into rayon closures, code_cache pre-computed, per-tx DB capacity capped, two fast paths added to bal_to_account_updates. Logic correct; deferred-error order preserved. |
| crates/storage/store.rs | Adds TrieMessage enum with Ping variant; wait_for_persistence_idle() correctly exploits sync_channel(0) rendezvous to signal worker idle state. |
| crates/vm/levm/src/db/mod.rs | RwLock replaced with DashMap<_, _, FxBuildHasher> for all three caches; prefetch methods simplified with par_iter + DashMap entry API. |
| cmd/ethrex/cli.rs | Magic-number sleep replaced by wait_for_persistence_idle(); bench-tool only change with no production effect. |
| crates/vm/levm/Cargo.toml | Adds dashmap 6.1 as a direct dep rather than a workspace dep; minor consistency nit. |
Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 3
crates/blockchain/blockchain.rs:870-882
**Single-recv invariant has no defensive check**
`handle_merkleization_bal` now consumes exactly one message and then never touches `rx` again. If `execute_block_parallel` is ever modified to send a second batch, the extra message will sit in the unbounded channel and be silently dropped — the merkleizer will proceed with only the first batch, producing a wrong state root with no error. A defensive `debug_assert!(rx.try_recv().is_err(), "expected exactly one batch from execute_block_parallel")` after the `Ok` arm would catch any accidental protocol change during development.
### Issue 2 of 3
crates/blockchain/blockchain.rs:876-881
**`Err(_)` path continues through Stage C (16 trie opens)**
When the channel is closed without a message (execution failure before `bal_to_account_updates`), the function returns `Vec::new()` and falls through to Stage C, which unconditionally spawns 16 threads to open the parent state trie even though all shards will have no items. Returning an empty `AccountUpdatesList` early in the `Err` arm would avoid this overhead without changing the visible behaviour, since the execution error surfaces via `execution_result?` regardless.
### Issue 3 of 3
crates/vm/levm/Cargo.toml:23
`dashmap` is added as an inline version requirement rather than routing through the workspace. Nearly every other dep in this file uses `workspace = true`. A workspace entry ensures the version is bumped in one place and keeps the lockfile diff minimal.
```suggestion
dashmap.workspace = true
```
Reviews (1): Last reviewed commit: "perf(l1): pre-size backup_storage_slot i..." | Re-trigger Greptile
| let updates: Vec<AccountUpdate> = match rx.recv() { | ||
| Ok(updates) => { | ||
| let current_length = queue_length.fetch_sub(1, Ordering::Acquire); | ||
| *max_queue_length = current_length.max(*max_queue_length); | ||
| updates | ||
| } | ||
| } | ||
| Err(_) => { | ||
| // Channel closed without a message — execution failed before | ||
| // bal_to_account_updates ran. Return empty work so the exec | ||
| // error surfaces in execution_result rather than being masked. | ||
| Vec::new() | ||
| } | ||
| }; |
There was a problem hiding this comment.
Single-recv invariant has no defensive check
handle_merkleization_bal now consumes exactly one message and then never touches rx again. If execute_block_parallel is ever modified to send a second batch, the extra message will sit in the unbounded channel and be silently dropped — the merkleizer will proceed with only the first batch, producing a wrong state root with no error. A defensive debug_assert!(rx.try_recv().is_err(), "expected exactly one batch from execute_block_parallel") after the Ok arm would catch any accidental protocol change during development.
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/blockchain/blockchain.rs
Line: 870-882
Comment:
**Single-recv invariant has no defensive check**
`handle_merkleization_bal` now consumes exactly one message and then never touches `rx` again. If `execute_block_parallel` is ever modified to send a second batch, the extra message will sit in the unbounded channel and be silently dropped — the merkleizer will proceed with only the first batch, producing a wrong state root with no error. A defensive `debug_assert!(rx.try_recv().is_err(), "expected exactly one batch from execute_block_parallel")` after the `Ok` arm would catch any accidental protocol change during development.
How can I resolve this? If you propose a fix, please make it concise.| Err(_) => { | ||
| // Channel closed without a message — execution failed before | ||
| // bal_to_account_updates ran. Return empty work so the exec | ||
| // error surfaces in execution_result rather than being masked. | ||
| Vec::new() | ||
| } |
There was a problem hiding this comment.
Err(_) path continues through Stage C (16 trie opens)
When the channel is closed without a message (execution failure before bal_to_account_updates), the function returns Vec::new() and falls through to Stage C, which unconditionally spawns 16 threads to open the parent state trie even though all shards will have no items. Returning an empty AccountUpdatesList early in the Err arm would avoid this overhead without changing the visible behaviour, since the execution error surfaces via execution_result? regardless.
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/blockchain/blockchain.rs
Line: 876-881
Comment:
**`Err(_)` path continues through Stage C (16 trie opens)**
When the channel is closed without a message (execution failure before `bal_to_account_updates`), the function returns `Vec::new()` and falls through to Stage C, which unconditionally spawns 16 threads to open the parent state trie even though all shards will have no items. Returning an empty `AccountUpdatesList` early in the `Err` arm would avoid this overhead without changing the visible behaviour, since the execution error surfaces via `execution_result?` regardless.
How can I resolve this? If you propose a fix, please make it concise.| strum = { version = "0.27.1", features = ["derive"] } | ||
| rustc-hash.workspace = true | ||
| rayon.workspace = true | ||
| dashmap = "6.1" |
There was a problem hiding this comment.
dashmap is added as an inline version requirement rather than routing through the workspace. Nearly every other dep in this file uses workspace = true. A workspace entry ensures the version is bumped in one place and keeps the lockfile diff minimal.
| dashmap = "6.1" | |
| dashmap.workspace = true |
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/vm/levm/Cargo.toml
Line: 23
Comment:
`dashmap` is added as an inline version requirement rather than routing through the workspace. Nearly every other dep in this file uses `workspace = true`. A workspace entry ensures the version is bumped in one place and keeps the lockfile diff minimal.
```suggestion
dashmap.workspace = true
```
How can I resolve this? If you propose a fix, please make it concise.Bundle of independent improvements to the BAL parallel-execution path
(execute_block_parallel + handle_merkleization_bal + warm_block_from_bal +
CachingDatabase), validated against a 149-block stress fixture (100M gas,
200-500 tx/block, ~25M-gas median blocks).
The changes (each is independently shippable; combined here for atomic
review since they touch overlapping code):
A. handle_merkleization_bal overlap fix (crates/blockchain/blockchain.rs)
`for updates in rx { ... }` blocked until channel close (= exec end).
execute_block_parallel sends exactly one batch up front from
bal_to_account_updates, so draining nothing useful serialized Stage B
(parallel storage roots) after exec instead of overlapping with it.
Replaced with a single rx.recv() and dropped the FxHashMap merge step
(BAL guarantees one entry per address).
C. import-bench inter-block sleep 500ms -> 100ms (cmd/ethrex/cli.rs)
Bench tooling change. The sleep gates background trie-layer writeback
from bleeding into the next block's per-block timer; 100ms is well
above measured Phase 2 cost on SSD. Cuts bench wall clock 80% without
affecting the per-block metric. NO effect on production paths.
Q1. Skip prestate read in bal_to_account_updates when BAL covers all info
fields (crates/vm/backends/levm/mod.rs). Two fast paths added:
storage-only updates (info: None, removed: false by construction);
full info coverage with non-empty post (removal impossible, info from
BAL alone). Slow path keeps existing behavior for partial coverage.
Q2. Per-tx GeneralizedDatabase capacity cap at 32
(crates/vm/backends/levm/mod.rs::execute_block_parallel). Previously
sized to bal.accounts().len() (often 100s on stress blocks); p50 tx
touches <10 accounts. Reduced allocator pressure across rayon workers.
Q3. Memoize code_from_bal results across seed_db_from_bal calls
(crates/vm/backends/levm/mod.rs). Pre-compute Code objects (hash +
jump_targets) once per BAL code change before the par_iter; pass cache
via optional param to seed_db_from_bal. Saves N-1 keccak+jump-target
scans per code change per block (N = tx count).
Q8. Move per-tx BAL validation into the rayon par_iter closure
(crates/vm/backends/levm/mod.rs::execute_block_parallel). Eliminates a
serial post-exec validation pass (~3 ms median across 200 txs). Drops
current_state and codes inside the closure after validation runs —
they no longer cross the rayon boundary, reducing per-tx allocator
pressure. Closure returns deferred Option<EvmError> so gas-limit check
still takes priority over BAL mismatch errors.
DashMap. CachingDatabase RwLock<HashMap> -> DashMap<_, _, FxBuildHasher>
(crates/vm/levm/src/db/mod.rs). Found via perf record: 11% of CPU was
RwLock::read_contended on the single account RwLock with 16 rayon
workers hammering it. Sharded concurrent map (64 default shards)
eliminates contention. Sequential paths unaffected (only 2 threads
access the cache, weren't contended).
Effect on non-BAL paths (block production, pre-Amsterdam, sequential
fallback): DashMap is neutral (low contention); other changes only fire
on the BAL parallel-validation path. No regressions in non-parallel paths.
Q8 in the BAL parallel-path perf bundle (7e081a6) moved per-tx BAL validation into the rayon closure. As part of the refactor the `unaccessed_pure_accounts.remove(&header.coinbase)` call was hoisted out of the per-tx loop to run unconditionally on every parallel-path invocation. For 0-tx blocks (empty / withdrawal-only on Amsterdam+) that unconditional removal silently exempts a BAL entry the protocol calls extraneous: fee finalization never runs without a tx, so geth's readerTracker never touches the coinbase either. A BAL coinbase entry on such a block is by construction extraneous and must surface as a validation error. Restoring the original gate (only exempt when at least one tx ran) re-rejects the block. Verified against EELS test_bal_invalid_extraneous_coinbase[empty_block] and [withdrawal_only].
CallFrameBackup::original_account_storage_slots starts each fresh account's inner FxHashMap at capacity 0. The first few SSTOREs in any new tx trigger hashbrown::reserve_rehash 3-4 times in sequence (0 → 4 → 8 → 16). perf record on a 460-block bal-devnet-7 mainnet-mix fixture (200 tx/block, ~65 Mgas) showed hashbrown::reserve_rehash as the 7th hottest leaf at 3.02B samples. After pre-sizing to 8 the same leaf drops to 2.19B, a 27% reduction in that frame and ~0.8% of total CPU recovered. Wall-clock impact is sub-noise on this workload (per-tx CPU savings happen inside rayon workers; wall-clock is bound by the longest tx per block) but the CPU savings compound on heavier workloads where critical-path txs hit the rehash chain. Wastes ~256 B per untouched account; negligible.
51a768d to
4c4aba3
Compare
|
Closing. Of the original bundle:
|
Summary
Rebased onto current
main(post bal-devnet-7-pr squash merge). Scope reduced from the original bundle: dropped theCachingDatabaseRwLock<HashMap>→DashMapswap and the 500ms → 100msimport-benchsleep tweak. Both require coordinated zkVM-feature-gating work that isn't done here.Remaining changes target the BAL parallel-execution path (
execute_block_parallel+handle_merkleization_bal+seed_db_from_bal). All changes are within already rayon-gated code paths -- zkVM builds (sp1/risc0/zisk/eip-8025) are unaffected.What is in this PR
handle_merkleization_baloverlap fix (crates/blockchain/blockchain.rs):for updates in rx { ... }blocked until channel close (= exec end).execute_block_parallelsends exactly one batch up front frombal_to_account_updates, so draining nothing useful serialized Stage B (parallel storage roots) after exec instead of overlapping with it. Replaced with a singlerx.recv()and dropped theFxHashMapmerge step (BAL guarantees one entry per address).bal_to_account_updateswhen BAL covers all info fields (crates/vm/backends/levm/mod.rs): two fast paths added -- storage-only updates (info: None, removed: false by construction); full info coverage with non-empty post (removal impossible, info from BAL alone). Slow path keeps existing behavior for partial coverage.GeneralizedDatabasecapacity cap at 32 (execute_block_parallel): previously sized tobal.accounts().len()(often 100s on stress blocks); p50 tx touches <10 accounts. Reduced allocator pressure across rayon workers.code_from_balresults acrossseed_db_from_balcalls: pre-computeCodeobjects (hash + jump_targets) once per BAL code change before the par_iter; pass cache via optional param toseed_db_from_bal. Saves N-1 keccak+jump-target scans per code change per block (N = tx count).execute_block_parallel): eliminates a serial post-exec validation pass (~3 ms median across 200 txs). Dropscurrent_stateandcodesinside the closure after validation runs -- they no longer cross the rayon boundary, reducing per-tx allocator pressure. Closure returns deferredOption<EvmError>so gas-limit check still takes priority over BAL mismatch errors.unaccessed_pure_accountscoinbase removal on!exec_results.is_empty()to fix 0-tx-block regression (Q8 had hoisted the removal outside the per-tx loop, silently exempting coinbase on empty/withdrawal-only blocks). Found via EELStest_bal_invalid_extraneous_coinbase[empty_block|withdrawal_only].backup_storage_slotinner map (separate commit): avoids rehash chain growth in the slot backup tracker.What was dropped from the original bundle (deferred to follow-ups)
CachingDatabaseRwLock → DashMap: theCargo.tomlrayon gating from fix(l1): re-enable rayon feature on blockchain crate #6662 makes the levm crate's rayon dep optional (zkVM builds skip rayon entirely). The DashMap swap rewrites struct fields and methods unconditionally; to land cleanly it needs paired zkVM-fallback variants matching the existing#[cfg(all(feature = "rayon", not(feature = "eip-8025")))]pattern. Split out to its own PR.import-bench500ms → 100ms sleep: superseded by feat(l1): improve rlp import #6666 which replaces the sleep withStore::wait_for_persistence_idle()entirely.Headline numbers (from the original bundle, before reduced scope)
460-block mainnet-mix fixture from bal-devnet-7 kurtosis localnet (~670 tx/s spamoor mix, 65 Mgas median, ~200 tx/block, 452/460 blocks carry BAL):
Numbers above include the DashMap change; expect a regression of a few % once that's removed (perf record showed 11% CPU on the RwLock contention -- DashMap removed that; without it some of that comes back, exact figure pending re-bench).
Test plan
cargo check -p ethrex(default features) -- cleancargo check -p ethrex-levm --no-default-features --features sp1(zkVM mode, no rayon) -- cleancargo fmt --check-- cleanmake test, EF tests, Hive Amsterdam